I: Abstract

Have you ever finished listening to a playlist on Spotify and had completely different, random, and unrelated songs begin playing in succession? If not, have you ever finished listening to a playlist on Spotify and wonder what songs are similar to that playlist that you just finished? Oftentimes, we find ourselves in a certain mood after listening to a specific type of music. For example, after listening to a pop playlist, we may be in a good mood or feeling somewhat energetic. For that reason, we would not want to start listening to slower paced music such as classical, shortly after finishing the pop playlist. Our group exists to solve this problem by finding a way to order other playlists based on similarity to the first playlist so that we can be in one continuous state of mind.


II: Introduction

Once again, through this project, our group would like to create somewhat of a recommender system that orders a second playlist based off of similarity to a first playlist. We plan to go about doing this by first taking data from two playlists that one of our group members has on Spotify. Then, we will do some introductory data analysis, to determine if the playlists have any similarities before ordering them. We will then begin the process of trying to order a playlist in terms of similarity to the first. One method that we will attempt is to average numerical variables from the songs first playlist as a way to categorize the first playlist as one “type” of music. Next, we will find the error between the numerical values of songs in the second playlist and the average values of the first. This will allow us to see which songs are similar to the first playlist in terms of those variables. One big deciding factor for us is to choose which variables to use as we do not want to be too broad, but at the same time, we want to include all the necessary variables in order that we can determine what consititues similar or not.


III: Data

The datasets that we will be using for this project are two of Tiffany Yin’s Spotify playlists. We were able to obtain the data from a website known as Organizeyourmusic which is linked here. This website was created by Paul Lamere who builds music recommenders at Spotify itself. His twitter is linked here. This specifc website was created on August 6, 2016 during The Science of Music Hackathon in NYC. The website runs in conjunction with Spotify in order to give the user data on their music tastes and playlists. After signing into your Spotify account on the website the user is able to get information on all their playlists. Therefore for this project, we are using two of Tiffany’s playlists that we titled playlist1 and playlist2 for simplicity. The first dataset has 55 observations while the second has 67. In addition, both datasets have 13 variables:

  • Title (title): the song name
  • Artist (artist): The music artist that the song belongs to, excluding features
  • Genre (genre): The genre of music that the song falls under
  • Beats Per Minute (bpm): The tempo of the song
  • Energy (enrgy): The energy of the song. The higher the value, the more energetic the song is
  • Danceability (dance): The higher the value, the more energetic the song is
  • Loudness (dB): The higher the value, the louder the song
  • Liveness (live): The higher the value, the more likely the song is a live recording
  • Valence (val): The higher the value, the more positive mood is for the song
  • Duration (dur): The length of the song
  • Acousticness (acous): The higher the value, the more acoustic the song is
  • Speechiness (spch): The higher the value, the more spoken word the song contains
  • Popularity (pop): The higher the value, the more popular the song is

A lot of these variables seem awfully objective and we are not completely sure how these are all measured, but the creator is a credible source so we are using his playlist program. All the variables in this dataset will be useful for this project, especially the numerical ones where we can do most of our calculations to determine if a song is similar to the other playlist or not. These numerical variables include BPM, energy, danceability, loudness, liveness, valence, duration, acousticness, speechiness, and popularity.


IV: Exploratory Data Analysis

This chart displays the number of songs within each genre of the first playlist. The first thing that I noticed when quickly glancing over this chart is one of the genres is listed as “NA” which I didn’t originally notice when looking at the raw dataset. The only reason I can think of for this labeling is that the program that converts all your playlist into a dataframe could not identify the genre for this one particular song, which happens to be “All in Time.”

In addition, it is clear that the genre with the most songs in this playlist is under the category of k-pop.


The above chart shows the number of songs within each genre of the second playlist, the playlist that we are ordering based on the first playlist. In comparison to the first playlist, there is not much difference in terms of the number of songs in the playlist as the first one has 55 songs while the second has 67 songs. However, there are a lot more genres in this second playlist as this playlist has 30 genres while the first has 18 genres. This shows that the second there is a wider variety of music in the second playlist.

Another important piece of information that we can take away from these two charts is that k-pop was the most frequent genre in both the first and second playlist as there were over 15 k-pop songs in playlist 1 and 10 and playlist 2. For this reason, we can assume that the two playlists are relatively similar as both of them have the most songs in the same genre: k-pop. Therefore when ordering the second playlist, we can make the prediction that the k-pop songs will be at the top of the list as they will have the most similarity to the songs in the first playlist.


The next thing that we decided to do in order to get a better view of the data was to organize the dataframe based on artist. We grouped both the playlists based on artist in order to determine how many songs per each artist were in the playlists. We then ordered them by which artists had the most songs on the playlist and displayed the top 5.

Top 5 Artists in Playlist1
Artist Number of Songs
88rising 8
DPR LIVE 8
Bazzi 3
pH-1 3
Rich Brian 3
Top 5 Artists in Playlist2
Artist Number of Songs
BLACKPINK 7
21 Savage 4
Friday Night Plans 4
Ariana Grande 3
Justin Hurwitz 3

As shown above, there is not much overlap between the top artists of both playlists as we do not see any of the top 5 artists in playlist one as top artists in playlist two.




V: Analysis and Discussion

Top 10 Playlist2 Songs Most Similar to Playlist1
Title Artist Genre Error From Avg
Better - SG Lewis x Clairo SG Lewis alternative r&b 83.74545
Don’t Know What To Do BLACKPINK k-pop 91.58182
This Could Be Us Rae Sremmurd hip hop 92.52727
New Rules Dua Lipa dance pop 96.58182
Rich Nigga Shit (feat. Young Thug) 21 Savage atl hip hop 100.90909
In My Feelings Kehlani pop 101.89091
I Didn’t Realize How Empty My Bed Was Until You Left Roderick Porter emo rap 102.65455
Heebiejeebies - Bonus Aminé hip hop 103.92727
all my friends 21 Savage atl hip hop 104.58182
PLAYING WITH FIRE BLACKPINK k-pop 108.70909

Top 5 NovPlaylist Songs Most Similar to EDM Playlist
Title Artist Genre Error From Avg
Fragile (feat. Melanie Fontana) ARMNHMR edm 70.09639
Waiting For You (feat. RUNN) Trivecta chillstep 115.67470
Love U Right - Yetep & Zephure Remix Tritonal big room 122.93976
Safe With Me (with Audrey Mika) Gryffin dance pop 125.43373
Better Off Lonely Nurko edm 130.44578

VI: Conclusion


VIII: Appendix

First half contains all the R code while everything in comments is code in Python that our group wrote in another environment. All code is our own work.

library(tidyverse)
library(sf)
library(readr)
library(USAboundaries)
library(USAboundariesData)
library(rnaturalearth)
library(rnaturalearthdata)
library(scales)
playlist1 <- read_csv("~/github/dsclub-spotify-recommender/data/Spotify Playlist 1 - My Spotify Playlist-2.csv")

playlist2 <- read_csv("~/github/dsclub-spotify-recommender/data/Playlist 2 (make a queue for this playlist) - Sheet1.csv")

no_songs = playlist1 %>% 
  group_by(genre) %>% 
  summarize(Num_of_songs = n())

ggplot(data = no_songs, aes(x = genre, y= Num_of_songs), las=2) +
  geom_bar(stat="identity") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
  labs(x = 'Genre',
       y = 'Number of Songs',
       title = 'Total Number of Songs in Each Genre in Playlist 1',
       caption = "Based on Tiffany's Playlist 1")
playlist2 = playlist2 %>% 
  rename(
    enrgy = nrgy,
    dance = dnce,
    genre = 'top genre'
  )

no_songs2 = playlist2 %>% 
  group_by(genre) %>% 
  summarize(Num_of_songs = n())

ggplot(data = no_songs2, aes(x = genre, y= Num_of_songs), las=2) +
  geom_bar(stat="identity") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
  labs(x = 'Genre',
       y = 'Number of Songs',
       title = 'Total Number of Songs in Each Genre in Playlist 2',
       caption = "Based on Tiffany's Playlist 2")
topartists1 = playlist1 %>% 
  group_by(artist) %>% 
  summarise(Num_songs = n()) %>% 
  arrange(-Num_songs) %>% 
  head(5)

knitr::kable(topartists1, caption = "Top 5 Artists in Playlist1", col.names = c("Artist","Number of Songs"), "simple", format.args = list(big.mark = ",", 
  scientific = FALSE))
topartists2 = playlist2 %>% 
  group_by(artist) %>% 
  summarise(Num_songs = n()) %>% 
  arrange(-Num_songs) %>% 
  head(5)

knitr::kable(topartists2, caption = "Top 5 Artists in Playlist2", col.names = c("Artist","Number of Songs"), "simple", format.args = list(big.mark = ",", 
  scientific = FALSE))
avgbpm = mean(playlist1$bpm)
avgenrgy = mean(playlist1$enrgy)
avgdance = mean(playlist1$dance)
avgdB = mean(playlist1$dB)
avglive = mean(playlist1$live)
avgval = mean(playlist1$val)
avgdur = mean(playlist1$dur)
avgacous = mean(playlist1$acous)
avgspch = mean(playlist1$spch)
avgpop = mean(playlist1$pop)

playlist2a = playlist2

playlist2a$bpmavg = avgbpm
playlist2a$enrgyavg = avgenrgy
playlist2a$danceavg = avgdance
playlist2a$dBavg = avgdB
playlist2a$liveavg = avglive
playlist2a$valavg = avgval
playlist2a$duravg = avgdur
playlist2a$acousavg = avgacous
playlist2a$spchavg = avgspch
playlist2a$popavg = avgpop
playlist2a = playlist2a %>% 
  mutate(bpmdiff = abs(bpm - bpmavg)) %>% 
  mutate(enrgydiff = abs(enrgy - enrgyavg)) %>% 
  mutate(dancediff = abs(dance - danceavg)) %>% 
  mutate(dBdiff = abs(dB - dBavg)) %>% 
  mutate(livediff = abs(live - liveavg)) %>% 
  mutate(valdiff = abs(val - valavg)) %>% 
  mutate(durdiff = abs(dur - duravg)) %>% 
  mutate(acousdiff = abs(acous - acousavg)) %>% 
  mutate(spchdiff = abs(spch - spchavg)) %>% 
  mutate(popdiff = abs(pop - popavg))

playlist2a = playlist2a %>% 
  mutate(totaldiff = bpmdiff + enrgydiff + dancediff + dBdiff + livediff + valdiff + durdiff + acousdiff + 
           spchdiff + popdiff)

arrangedplaylist2 = playlist2a %>% 
  arrange(totaldiff) %>% 
  select(title, artist, genre, totaldiff) %>% 
  head(10)

knitr::kable(arrangedplaylist2, caption = "Top 10 Playlist2 Songs Most Similar to Playlist1", col.names = c("Title","Artist","Genre","Error From Avg"), "simple", format.args = list(big.mark = ",", 
  scientific = FALSE))
edmplaylist <- read_csv("~/github/dsclub-spotify-recommender/data/edmplaylist - Sheet1.csv")

novplaylist <- read_csv("~/github/dsclub-spotify-recommender/data/novplaylist - Sheet1.csv")

no_songs_edm = edmplaylist %>% 
  group_by(genre) %>% 
  summarize(Num_of_songs = n())

ggplot(data = no_songs_edm, aes(x = genre, y= Num_of_songs), las=2) +
  geom_bar(stat="identity") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
  labs(x = 'Genre',
       y = 'Number of Songs',
       title = 'Total Number of Songs in Each Genre in EDM Playlist',
       caption = "Based on EDM Playlist")
no_songs_nov = novplaylist %>% 
  group_by(genre) %>% 
  summarize(Num_of_songs = n())

ggplot(data = no_songs_nov, aes(x = genre, y= Num_of_songs), las=2) +
  geom_bar(stat="identity") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
  labs(x = 'Genre',
       y = 'Number of Songs',
       title = 'Total Number of Songs in Each Genre in November Playlist',
       caption = "Based on November Playlist")
avgedmbpm = mean(edmplaylist$bpm)
avgedmenrgy = mean(edmplaylist$enrgy)
avgedmdance = mean(edmplaylist$dance)
avgedmdB = mean(edmplaylist$dB)
avgedmlive = mean(edmplaylist$live)
avgedmval = mean(edmplaylist$val)
avgedmdur = mean(edmplaylist$dur)
avgedmacous = mean(edmplaylist$acous)
avgedmspch = mean(edmplaylist$spch)
avgedmpop = mean(edmplaylist$pop)

novplaylist1a = novplaylist

novplaylist1a$bpmavg = avgedmbpm
novplaylist1a$enrgyavg = avgedmenrgy
novplaylist1a$danceavg = avgedmdance
novplaylist1a$dBavg = avgedmdB
novplaylist1a$liveavg = avgedmlive
novplaylist1a$valavg = avgedmval
novplaylist1a$duravg = avgedmdur
novplaylist1a$acousavg = avgedmacous
novplaylist1a$spchavg = avgedmspch
novplaylist1a$popavg = avgedmpop

novplaylist1a = novplaylist1a %>% 
  mutate(bpmdiff = abs(bpm - bpmavg)) %>% 
  mutate(enrgydiff = abs(enrgy - enrgyavg)) %>% 
  mutate(dancediff = abs(dance - danceavg)) %>% 
  mutate(dBdiff = abs(dB - dBavg)) %>% 
  mutate(livediff = abs(live - liveavg)) %>% 
  mutate(valdiff = abs(val - valavg)) %>% 
  mutate(durdiff = abs(dur - duravg)) %>% 
  mutate(acousdiff = abs(acous - acousavg)) %>% 
  mutate(spchdiff = abs(spch - spchavg)) %>% 
  mutate(popdiff = abs(pop - popavg))

novplaylist1a = novplaylist1a %>% 
  mutate(totaldiff = bpmdiff + enrgydiff + dancediff + dBdiff + livediff + valdiff + durdiff + acousdiff + 
           spchdiff + popdiff)

novplaylist1 = novplaylist1a %>% 
  arrange(totaldiff) %>% 
  select(title, artist, genre, totaldiff) %>% 
  head(5)

knitr::kable(novplaylist1, caption = "Top 5 NovPlaylist Songs Most Similar to EDM Playlist", col.names = c("Title","Artist","Genre","Error From Avg"), "simple", format.args = list(big.mark = ",", 
  scientific = FALSE))

# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
# 
# f, axes = plt.subplots(2, 5, figsize=(12, 7), tight_layout = True)
# plt.suptitle('Distplot for Numerical Variables in Playlist1', fontsize = 20)
# sns.distplot(playlist1["bpm"] , color="skyblue", ax=axes[0, 0])
# sns.distplot(playlist1["energy"] , color="olive", ax=axes[0, 1])
# sns.distplot(playlist1["dance"] , color="gold", ax=axes[1, 0])
# sns.distplot(playlist1["dB"] , color="teal", ax=axes[1, 1])
# sns.distplot(playlist1["live"] , color="green", ax=axes[1, 2])
# sns.distplot(playlist1["val"] , color="orange", ax=axes[0, 2])
# sns.distplot(playlist1["dur"] , color="blue", ax=axes[0, 3])
# sns.distplot(playlist1["acous"] , color="red", ax=axes[0, 4])
# sns.distplot(playlist1["spch"] , color="purple", ax=axes[1, 3])
# sns.distplot(playlist1["pop"] , color="yellow", ax=axes[1, 4])
# plt.show()
# plt.tight_layout()
# 
# fig, axes = plt.subplots(2, 5, figsize=(12, 7), tight_layout=True)
# plt.suptitle('Histogram for Numerical Variables in Playlist1', fontsize = 20)
# playlist1.hist('bpm', bins=10, ax=axes[0,0])
# playlist1.hist('energy', bins=10, ax=axes[0,1])
# playlist1.hist('dance', bins=10, ax=axes[0,2])
# playlist1.hist('dB', bins=10, ax=axes[0,3])
# playlist1.hist('live', bins=10, ax=axes[0,4])
# playlist1.hist('val', bins=10, ax=axes[1,0])
# playlist1.hist('dur', bins=10, ax=axes[1,1])
# playlist1.hist('acous', bins=10, ax=axes[1,2])
# playlist1.hist('spch', bins=10, ax=axes[1,3])
# playlist1.hist('pop', bins=10, ax=axes[1,4])
# plt.show()
# plt.tight_layout()
library(icon)
fa("globe", size = 5, color="green")